Easily Adaptable Handwriting Recognition in Historical Manuscripts
نویسنده
چکیده
Easily Adaptable Handwriting Recognition in Historical Manuscripts by John Alexander Edwards III Doctor of Philosophy in Computer Science and the Designated Emphasis in Communication, Computation and Statistics University of California, Berkeley Professor David Forsyth, Co-Chair Professor Jitendra Malik, Co-Chair As libraries increasingly digitize their collections, there are growing numbers of scanned manuscripts that current OCR and handwriting recognition techniques cannot transcribe, because the systems are not trained for the scripts in which these manuscripts are written. Documents in this category range from illuminated medieval manuscripts to handwritten letters to early printed works. Without transcriptions, these documents remain unsearchable. Unfortunately with existing methods, a user must manually label large amounts of text in the target font to adapt the system to a new script. Some systems require that a user manually segment and label instances of each glyph. Others provide for less costly training, allowing a user to segment and label entire lines of text instead of individual characters. Still, the collections we consider are extremely diverse, to the extent that in some cases almost every document may be in a different style. Because of this, the cost of manually transcribing dozens of lines of text for each font is prohibitively high. In this dissertation, we introduce methods that significantly reduce the manual labor involved in training a character recognizer to new scripts. Rather than forcing a user to transcribe portions of each target document, our system leverages general
منابع مشابه
Image Segmentation of Historical Handwriting from Palm Leaf Manuscripts
Palm leaf manuscripts were one of the earliest forms of written media and were used in Southeast Asia to store early written knowledge about subjects such as medicine, Buddhist doctrine and astrology. Therefore, historical handwritten palm leaf manuscripts are important for people who like to learn about historical documents, because we can learn more experience from them. This paper presents a...
متن کاملRetrieving Historical Manuscripts using Shape
Convenient access to handwritten historical document collections in libraries generally requires an index, which allows one to locate individual text units (pages, sentences, lines) that are relevant to a given query (usually provided as text). Currently, extensive manual labor is used to annotate and organize such collections, because handwriting recognition approaches provide only poor result...
متن کاملA Statistical Approach to Retrieving Historical Manuscript Images without Recognition
Handwritten historical document collections in libraries and other areas are often of interest to researchers, students or the general public. Convenient access to such corpora generally requires an index, which allows one to locate individual text units (pages, sentences, lines) that are relevant to a given query (usually provided as text). Several solutions are possible: manual annotation (ve...
متن کاملBinarization-free Text Line Extraction for Historical Manuscripts
Nowadays, large collections of old historical manuscripts, which contain valuable information about our cultural heritage, exist in libraries around the world. Recently, there has been much interest in their digitization for preservation reasons, since many of the available manuscripts’ quality has deteriorated from exposure to the environment. Digitization though is only the first step to make...
متن کاملOff-line Arabic Handwritten Recognition Using a Novel Hybrid HMM-DNN Model
In order to facilitate the entry of data into the computer and its digitalization, automatic recognition of printed texts and manuscripts is one of the considerable aid to many applications. Research on automatic document recognition started decades ago with the recognition of isolated digits and letters, and today, due to advancements in machine learning methods, efforts are being made to iden...
متن کامل